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Abstract 

Most work in the area of statistical relational learning (SRL) is focussed on discrete 
data, even though a few approaches for hybrid SRL models have been proposed that com¬ 
bine numerical and discrete variables. In this paper we distinguish numerical random 
variables for which a probability distribution is defined by the model from numerical input 
variables that are only used for conditioning the distribution of discrete response vari¬ 
ables. We show how numerical input relations can very easily be used in the Relational 
Bayesian Network framework, and that existing inference and learning methods need only 
minor adjustments to be applied in this generalized setting. The resulting framework pro¬ 
vides natural relational extensions of classical probabilistic models for categorical data. 

We demonstrate the usefulness of RBN models with numeric input relations by several 
examples. 

In particular, we use the augmented RBN framework to define probabilistic models 
for multi-relational (social) networks in which the probability of a link between two nodes 
depends on numeric latent feature vectors associated with the nodes. A generic learning 
procedure can be used to obtain a maximum-likelihood fit of model parameters and la¬ 
tent feature values for a variety of models that can be expressed in the high-level RBN 
representation. Specifically, we propose a model that allows us to interpret learned latent 
feature values as community centrality degrees by which we can identify nodes that are 
central for one community, that are hubs between communities, or that are isolated nodes. 

In a multi-relational setting, the model also provides a characterization of how different 
relations are associated with each community. 

1. Introduction 

Statistical-relational learning (SRL) models have mostly been developed for discrete data 
(see [7, 5] for general overviews). An important reason for this lies in the fact that inference 
for hybrid models combining discrete and continuous variables quickly lead to inference 
problems that consist of integration problems for which no closed-form solutions are avail¬ 
able. Among the relatively few proposals for SRL frameworks with continuous variables 
are hybrid Markov Logic Networks [21], hybrid ProbLog [9], and hybrid dependency net¬ 
works [17]. In these works the complexity of the inference problem is addressed by focussing 
on approximate, sampling based methods [21, 17], or by imposing significant restrictions on 
the models, so that the required integration tasks for exact inference become solvable [9]. 

In the first part of this paper we first take a closer look at the semantic and statistical 
roles that continuous variables can play in a probabilistic relational model. We arrive at 
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a main distinction between numeric input relations, and numeric probabilistic relations, 
and we argue that for many modeling and learning problems involving numeric data, only 
numeric input relations are needed. We then proceed to show how numeric input relations 
can be integrated into the Relational Bayesian Network (RBN) language [12], with little or 
no cost in terms of algorithmic developments or computational complexity. 

The second part of the paper demonstrates by several examples and applications the 
usefulness of modeling with numeric input relations, and the feasibility of the associated 
learning problems. First, a synthetic environmental modeling example shows how RBNs 
with numeric input relations support natural and interpretable models that provide a rela¬ 
tional extension for traditional statistical models (Section 4.2). 

We then turn to community detection in (social) networks as our main application. 
Utilizing a general SRL modeling language allows us to encode a variety of probabilistic 
network models on a single platform with a single generic inference and learning engine. 
We use RBNs with numeric input relations to encode probabilistic models with continu¬ 
ous latent features representing community structure. The SRL framework makes it easy 
to develop models for multi-relational networks (a.k.a multiplex or multi-layer networks), 
where nodes are connected by more than one type of link. In such networks, it will usually 
no longer be possible to reduce community structure detection to a form of graph parti¬ 
tioning [8], because different relations may define a multitude of different, overlapping, and 
partly conflicting community structures. We therefore propose a latent feature model that 
allows us to identify a number of communities with no restrictions on how communities are 
related in terms of inclusion or disjointness. Furthermore, for each community we obtain 
a characterization of how they are defined in terms of the given network relations, and we 
are able to define a probabilistic significance measure that ranks the detected communities 
in terms of their explanatory value. Last but not least, we obtain for each node in the 
network, and each community, a community centrality degree (ccd). Unlike most previously 
defined soft or fuzzy community membership degrees, these ccd values are not normalized 
to sum up to one over the different communities. They thereby allow us, for example, to 
identify influential hub nodes [22] between communities (nodes with high centrality degree 
for multiple communities). 

2. SRL Models and Numeric Relations 

A SRL model defines a probability distribution over relational structures. A model can be 
instantiated over different input domains, which may consist only of a set of objects, or, 
more generally, a set of objects together with a set of known input relations. Thus, a SRL 
model defines conditional probability distributions 

P(IRprob | IRin, D, 6) (1) 

where D ranges over a class of domains (usually the class of all finite sets), IRm ranges over 
interpretations over D of a set of input relations Ri n , and IR pro b ranges over interpretations 
of a set of probabilistic (or output) relations R pro b ■ Interpretations IRi n and IR pro b are 
given as value assignment to ground atoms r(d) (d € D arity ( r )). In the discrete case, 
each relation r has an associated finite range of possible values. The distinction between 
input and probabilistic relation need not be explicitly defined in a given model. Input 
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relations can also be seen as probabilistic relations that are fully instantiated as evidence 
in a given inference or learning problem. Markov Logic Networks (MLNs) [18] are one 
prominent framework in which there is only such an implicit distinction between input and 
probabilistic relations. 

2.1 Hybrid SRL Models 

Hybrid SRL models allow the introduction of numeric relations, so that atoms r(d) become 
real-valued variables. Based on (1), one can distinguish numerical input and numerical 
probabilistic relations. To obtain a clearer view of the implications of this distinction, 
consider a purely continuous, classical linear regression model: 

P(Y | X , a, (3,a) ~ a -\- X ■ (3 + N(0, a 2 ) (2) 

This model contains three different types of numerical variates: Y, the response variable , is 
a random variable with a Gaussian distribution. X , the predictor variables may be random 
variables themselves, or they may be non-probabilistic inputs whose values have to be 
instantiated before inferences about Y can be made. Finally, a, (3 , a are parameters of the 
model. The functional specification (2) is completely symmetric for the predictor variables 
X and the parameters (3. The conceptual difference between the two only becomes apparent 
when one considers repeated random samples Y\,... ,Y n . These samples would usually be 
drawn with varying values X\ ,... ,X n for the predictor variables, whereas the parameters 
(3 remain constant. 

In SRL, data does not usually consist of iid samples, and one learns from a single 
observed pair IRi n , IR pro i,. The distinction we can make in (2) between X and (3 , therefore, 
is no longer supported. That means that in (1) numeric input atoms r(d ) can be equally 
interpreted as predictor variables, or as object-specific parameters. Neither of these views 
requires to define r(d) as a random variable with an associated probability distribution: as 
long as (1) is used purely as a conditional model, no prior distribution for numeric input 
relations is needed. Such a model will not define (posterior) probability distributions for 
numerical atoms, but still support maximum likelihood inference for the numerical atoms, 
which depending on the interpretation of the input relations can be seen as MPE inference 
for unobserved predictor variables, or as estimation of object-specific parameters. 

The clear focus of Hybrid ProbLog [9] is to introduce numeric probabilistic relations. 
The language provides constructs to explicitly define distributions of numeric atoms as 
Gaussian with specified mean and standard deviation, for example. 

The nature of Hybrid MLNs [21] is a little less clear-cut, due to the only implicit dis¬ 
tinction between input and output relations 1 . Hybrid MLNs extend standard MLNs by 
numeric properties (which we can identify with numeric relations in our terminology) from 
which weighted numeric features can be constructed. Examples of weighted features that 
can be included in a hybrid MLN then are 

distance(X,Y) w\ (3) 

— (length(Z) — 1.5) 2 u >2 (4) 

1. For the following discussion we assume that the reader is familiar with MLNs 
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A ground instance of a weighted feature contributes a weight to a possible world x (i.e. an 
interpretation of all discrete and numerical relations over a given domain) that is equal to 
the value of the ground feature multiplied with the weight of the feature. The probability 
of x then is given by 1 /Ze w ^ x \ where W is the sum of weights from all groundings of all 
features, and Z a normalization constant [21]. This definition, however, requires that a finite 
normalization constant Z can be found, which means that f e w must be finite, where the 
integral represents integration over all numeric properties, and summation over all discrete 
relations. This normalization is not possible, for example, for an MLN only consisting 
of the weighted feature (3) with w± = 1.0. No probability distribution for the distance 
property then is defined. For an MLN consisting of (4), on the other hand, normalization 
is possible, and a Gaussian distribution for the ground length atoms is defined. When a 
hybrid MLN contains numeric properties that prevent normalization, then no probabilistic 
inference for these properties is possible, and they either have to be instantiated to perform 
probabilistic inference for other properties and relations, or one has to use the model for 
MPE inference tasks. In summary, hybrid MLNs support numeric probabilistic relations 
under the condition that a finite normalization constant can be computed; otherwise they 
support numeric input relations for which no distribution is defined. 

3. Numerical Input Relations in RBNs 
3.1 Modeling 

The RBN language is based on probability formulas that define the probability P(r(a) = 
true ) for ground relational atoms r(a). The language of probability formulas is defined by 
a parsimonious grammar that is based on the two main constructs of convex combinations 
and combination functions. The following are two examples of the convex combination 
construct. To improve the readability and understandability of the formulas, we here use a 
modification of the original very compact syntax of [12], and write convex combinations in 
the form of “wif-then-else” statements (“wif” stands for “weighted-if”). 


P(heads(T) = true) <— wif fair(T) then 0.5 else 0.7 (5) 

P(cancer(X) = true ) 4— wif 0.3 then genetic-predisposition(X) else 0.1 (6) 

Formula (5) defines the probability of a coin-toss to come up heads as 0.5 if a fair coin 
is tossed, and 0.7 otherwise. Here the formula in the wif-clause is an ordinary Boolean 
condition. In formula (6) the wif-clause is a numerical mixture coefficient. The formula 
thereby defines the probability that X has cancer as a mixture of a contribution coming 
from a genetic predisposition (mixture weight 0.3), and a base rate of 0.1 (mixture weight 
1-0.3). 

Generally a formula wif A then B else C is evaluated over a concrete input domain to 
a probability value valfwif A then B else C) as val(A)-val(B) + (l — val(A))-val(C). There 
are two features in this modeling approach that make an integration of numerical input 
relations extremely easy: first, logical input relations already are interpreted numerically: 
for example, val(fair(T )) is defined as 0 or 1, depending on whether fair(T ) is false or 
true. Second, according to the grammar of probability formulas, numerical constants and 
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logical atoms are just different base cases for probability (sub-) formulas, which can be used 
interchangeably in the construction of more complex formulas. 

The generalization from Boolean to numerical relations, thus, is almost trivial: one can 
just allow relational atoms r(a) to evaluate to real values val(r(a )) in any range [min, max], 
where —oo < min < max < oo depend on the intended meaning of r. The only additional 
modification one has to make is to ensure that in the end probability formulas defining 
the probability for a Boolean response variable return values in the interval [0,1]. We do 
this by using the RBN combination function construct, which generally take multisets of 
(probability) values as inputs, and return a single probability value. In particular, we can 
define logistic regression as a RBN combination function. An example of a RBN model 
with numeric input relations and logistic regression combination function then is: 

P(cancer(A)=true) <— COMBINE intensity(R) 

WITH l-reg 

FORALL R 1 ’ 

WHERE exposed (A, R) 

Again we here use a slightly more verbose version of the combination function syntax 
than the original one. Formula (7) defines for a person A the probability of getting cancer 
using the logistic regression function applied to the set of all intensity values of radiation 
sources R that A was exposed to. Thus, assuming that the numerical attribute inten¬ 
sity, and the logical relation exposed are known, the probability P(cancer(A)) evaluates to 
exp(S)/ (1 + exp(S)), where S = Y.R-.ex P osed{A,R) intensity(R). 

3.2 Inference and Learning 

Probabilistic inference for RBNs with numerical input relations is no different from inference 
in a purely Boolean setting. All inference approaches that have previously been used for 
RBNs (i.e., compilation to Bayesian networks or arithmetic circuits, importance sampling) 
can still be used without modifications. 

For learning the values of numerical relations, we use a slightly generalized version 
of the gradient graph that was introduced in [13] for parameter learning in RBNs. The 
resulting likelihood graph data structure is illustrated in Figure 1. The likelihood graph is 
a computational structure related to arithmetic circuits. Each node of the graph represents 
a function of the inputs in the bottom layer of the graph: model parameters (e), values of 
ground atoms in the numerical input relations (f), and truth values of ground probabilistic 
atoms that are unobserved in the data (g). 

The topmost layer of nodes in the graph corresponds to ground probabilistic atoms that 
are instantiated in the data (a), or that are unobserved and need to be marginalized out 
for the computation of the likelihood (b) (there is a one-to-one correspondence between the 
nodes in (b) and (g)). The function associated with the ground atom nodes of this layer is 
the probability of the atom, given the current parameter settings, and instantiations of the 
ground atoms in (g). 

The nodes in the intermediate layers (c) represent sub-formulas of the probability for¬ 
mulas for the ground atoms in (a) and (b). Finally, the root node represents the product 
over all nodes in (a) and (b), and thus represents the likelihood of the joint configuration 


5 




(e) (f) (g) 

Figure 1: The likelihood graph. See text for key to (a)-(g) 


of probabilistic atoms consisting of the observed values for (a), and the current setting at 
the nodes (g) for the atoms in (b). 

The likelihood graph is constructed in a top-down manner by a recursive decomposition 
of the probability formulas. In this decomposition also sub-formulas will be encountered that 
have a constant value, and do not depend on any parameters or unobserved atoms. These 
sub-formulas are not represented explicitly by nodes in the graph and are not decomposed 
further. Their constant value is directly assimilated into the function computation at their 
parent nodes. 

The likelihood graph supports computation of the likelihood values and the gradient 
of the likelihood function with respect to the numerical parameters (e) and (f). These 
computation are linear in the number of edges of the graph. Based on these elementary 
computations, the likelihood graph can be used for parameter learning via gradient ascend, 
marginalization over unobserved atoms via MCMC sampling (mainly used when learning 
from incomplete data), and MAP inference for unobserved probabilistic atoms. For param¬ 
eter learning, we perform multiple restarts of gradient ascend with random initializations 
for the nodes (e), (f), (g). 

4. Examples: Standard Logistic Regression 

We now demonstrate the usefulness of RBN models with numerical input relations for 
practical modeling and learning problems, and the feasibility of learning via likelihood 
graph based gradient ascent. In this section we present examples that demonstrate the use 
of the logistic regression model in our relational framework for constructing models that 
closely follow conventional and interpretable statistical modeling approaches. 

4.1 Propositional: Cancer Data 

In a first experiment we test whether standard “propositional” logistic regression is properly 
embedded in our relational framework. For this we use a very small dataset containing data 
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Figure 2: Model learned for remission probability 


on 27 cancer patients that was originally introduced in [16], and which is often used as a 
standard example for logistic regression. We use a simplified version of the dataset given 
in [1, Table 5.10], which contains a single numerical predictor variable LI, and a binary 
response variable indicating whether the cancer is in remission after treatment. A standard 
logistic regression model for predicting remission is represented by the probability formula 


P(remission(A)=true) COMBINE a + /3 ■ LI(A) WITH l-reg. (8) 

The combination function construct in this formula is somewhat degenerate, since it 
here effects no combination over a multiset of values, and simply reduces to the application 
of the logistic regression function to the single number a + (3 ■ LI (A). 

Figure 2 shows the probability of the response variable as a function of the predictor 
variable for the parameters a , f3 learned from the RBN encoding (8), and for the parameters 
given in [1] (which were fitted using the SAS statistics toolbox). Clearly, our gradient ascent 
approach using the likelihood graph here yields results that are compatible with standard 
approaches to logistic regression. 

4.2 Relational: Water Network 

In this section we consider a toy model for the propagation of pollution in a river network. 
This example demonstrates the ability to integrate into our relational modeling framework 
standard logistic response models based on meaningful and interpretable predictor relations. 

Input domains for this model consist of measuring stations in a river network that mea¬ 
sure whether the river is polluted or not. Stations are related by the Boolean upstream 
relation, denoting that one station is directly upstream of another (i.e., without any other 
stations in-between). For any pair of stations in the upstream relation, there also is a nu¬ 
merical relation invdistance containing the inverse of the distance between the two stations. 

The outermost wif-then-else construct in lines 1,2,10 of the model of Table 1 defines 
the polluted attributed as a mixture of two factors: first, there is a base probability of 
(1 — 0.6) • 0.2 = 0.08 for pollution to occur regardless of pollution already being measured 
upstream. Second, lines 2.-9. contain a propagation model of pollution that is measured at 
one or several upstream stations. This probability sub-formula computes the expression 
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Figure 3: Water Network Input Domain 
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Table 1: Pollution model 

polluted ( S ) WIF 0.6 

THEN COMBINE a, 

COMBINE WIF polluted(V) 

THEN (3 * invdistance(V,S) 

ELSE 0.0 
WITH sum 
FORALL V 

WHERE upstream(V,S) 

WITH 1-reg 
ELSE 0.2; 


l-reg (a + f3 invdistance(V, S )). 

V: upstream(V,S)Sz 
pollutediy) 

Generalizing from the baseline example of Section 4.1 we investigate whether the param¬ 
eters of the model can still be identified from independent samples of the polluted attribute. 
To this end we sample N independent joint instantiations of the polluted attribute for the 
12 measuring stations of the domain in Figure 3 with parameters a = —3 and (3 = 2 , and 
the values of the invdistance relation as shown in Figure 3. All experiments are performed 
using 20 random restarts. In the first experiment the values of the invdistance relation are 
fixed at their true values of Figure 3, and we only learn the values of a, (3. Figure 4 (a) 
shows the learned values for increasing sample sizes N = 20, 50, 200, 500. Clearly, quite 
accurate estimates for a, f3 are already obtained from relatively small sample sizes. 

In a second experiment, a, f3 are fixed at their true values, and the values of the in¬ 
vdistance relation are learned. Figure 4 (b) shows the convergence of the estimates for 
the invdistance values of five different pairs of neighboring measuring stations. Again, 
the true values are consistently learned. The required sample size is much larger than 
for a, (3, because a single sample only contains relevant information for the estimation of 
invdist.ance(Si, Sj) when polluted(Si) is true in that sample. 

In a third experiment, both a, f3 and the invdistance relations are learned. In this setup 
no convergence to the true values can be expected, since the parameters are not jointly 
identifiable: for any given setting of the parameters a,/3, invdistanceQ , one obtains equiva- 
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Figure 4: Convergence of estimates in water network example 
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Table 2: Size of likelihood graph and learning times 



1000 

2000 

N = 

4000 

8000 

# nodes 

7095 

14611 

29331 

59325 

28995 

79500 

115874 

318539 

construction (s) 

0.37 

0.63 

1.17 

2.69 

0.87 

1.25 

2.96 

5.86 

time/restart (s) 

0.94 

13.26 

3.38 

36.92 

6.61 

59.64 

12.45 

117.30 


lent solutions in the form a,X-j3, invdistance/X. For this reason, we compare the products 
/3 • invdistanceQ for both the parameters in the generating and learned model. Figure 4 
(c) shows these products for the same pairs of stations as in (b). The convergence here 
shows that even if the exact values of the parameters cannot be learned, a probabilistically 
equivalent model is learned (the learned value of the parameter a also converges to the true 
value -3.0). 

Table 2 shows the size of the likelihood graph, the time for construction, and the average 
time per restart for the gradient ascent optimization. For different values of the sample size 
N, these numbers are given for the case where we only learn the relation invdistance {) (top 
entry in each cell of the table), and the case where we learn a, f3 and invdistanceQ (bottom 
entry). The likelihood graph is significantly larger when also learning a,/3, because here 
more sub-formulas of the instantiated model depend on unknown parameters, and therefore 
can not be pruned in the construction. 

5. Application: Community Structure in Multi-Relational Networks 

We will now apply learning of numeric input relations in RBNs for community structure 
analysis in multi-relational networks. Figure 5 shows a small network with 6 individuals 
connected by two different types of (undirected) links. Considering only the green (solid) 
link relation, one would identify {1, 3, 5} and {2, 4, 6} as communities, whereas the red 
(dashed) link relation points to communities {1, 2} and {3, 4, 5, 6}. Moreover, the com¬ 
munity structure {1, 3, 5}, {2, 4, 6}, would indicate that the red links are representing an 
antagonistic relationship that is more likely to exist in between communities, than within 
communities. Considering both links simultaneously, and assuming both are positive indi¬ 
cators of communities, one may also consider {3,5} and {4,6} as the most clearly defined 
communities, to which 1 and 2 are more loosely connected. 

This tiny example illustrates how multiple relations can lead to a rather complex com¬ 
munity landscape, with multiple possible views and interpretations. Even though multi- 
relational networks occur naturally, research on community detection has very much fo¬ 
cused on the single-relational case. Proposals for dealing with multi-relational networks 
often consist of reductions to the single-relational setting, either by aggregating all rela¬ 
tions into a single weighted relations [4, 15], or by aggregating results from community 
detection performed for each relation separately [2]. 
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Figure 5: Multi-relational community structure 



Figure 6: Multi-relational community structure 


In Section 5.2 we will propose a latent feature model that takes all relations as input in 
a non-aggregated form, and returns multiple communities along with a characterization of 
how the different communities are correlated with the relations. 

Figure 6 shows a single-relational network with a relatively clear two-community struc¬ 
ture. The two nodes 3 and 4, however, are perfectly ambiguous with regard to their com¬ 
munity membership. Most existing soft clustering methods would give both nodes equal 
membership degrees of 0.5 for both communities. However, clearly it is desirable to be 
able to distinguish node 3, which is well connected to both communities, and which for 
information diffusion purposes would be the most influential node in the network [6], from 
node 4, which is completely isolated. Nodes 1 and 2 are both very strongly associated with 
the community on the left. However, instead of assigning a membership degree close to 1.0 
to both of them, it will be more informative to assign a higher membership degree to node 
2 than to node 1, so that the membership degree also reflects the centrality of the nodes 
for the communities. In our model, the learned values of latent numeric relations can be 
interpreted as community centrality degrees, that reflect the degree of connectivity of a node 
with all communities. 

5.1 Community Centrality Degrees 

In this section we first consider the single-relational case to explore models for learning 
community membership degrees that satisfy the desiderata outlined in the preceding section. 
We introduce a numerical binary relation u(V,C ), whose arguments are a node V, and a 
community C. The relation u is constrained to be non-negative. We can then define the 
following probabilistic model: 

P(link(V,W)) = {° eS/{1 + eS) V v =l (9) 

where 

S = a + ^ U (V, C) ■ u(W, C) (10) 

C : community{C) 
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Figure 7: Zachary network 


and a is a real-valued constant (the intercept, in the language of log-linear models). This 
model is quite straightforward, and closely related to other models for link prediction (e.g. 
for recommender systems) in which the affinity of objects to be connected by a link is 
measured by the inner product of latent feature vectors associated with the objects. We 
note that in contrast to structurally similar probabilistic latent semantic models [11] the 
variables u(y,C),u(W,C) have no semantics as (conditional) probabilities, and (9),(10) is 
not a mixture model with the communities as hidden mixture components. The model is 
readily encoded as a RBN. The observed links in a network with node set JV then define 
the likelihood function 

L(a,u) = JJ P(link(V, W)) JJ (1 - P(link(V,W))) (11) 

V,WeJ\f:link(V, W )=true V,W£p:link(V, Wj=false 

The generic learning method described in Section 3.2 can be used to fit the model parameters 
a U {u(V, C) | V, C : node(V), cluster(C)}. 


12 













We applied this model to the well known Zachary Karate Club network depicted in 
Figure 7 (a), where the node colors represent the known “ground truth” communities in 
this network [24]. Figure 7 (b) illustrates the learned u -values when the model is instantiated 
for 2 communities C i, Cb. Nodes V are plotted in 2-dimensional space according to their 
u(V,Ci),u(V,C 2 ) values. Node colors still represent the ground truth. Some individual 
nodes, and groups of nodes, are marked correspondingly in Figure 7 (a) and (b). We first 
observe that the nodes 1 and 34 with maximal u(-,C i) and u(-, C 2 )-values are central nodes 
of their respective communities. In contrast, the node groups A and B are well-connected 
with their own communities, but separated from the other community. C is a large group 
of nodes with u(-,C \) and it(-, C^-values of similar magnitude. All nodes in this group 
can be seen as potential hub nodes between the two communities, but node 3 with the 
highest sum u(-,Ci) + '«(■, C 2 ) is most clearly identified as a well-connected hub between 
the communities. 

Two further observations are worthwhile making: the fact that some u -values of zero 
have been learned indicates that allowing negative u-values could lead to higher likeli¬ 
hood scores. However, for the purpose of interpretability of the results, imposing the 
non-negativity constraint for u still seems beneficial. Second, for nodes that are pairwise 
structurally indistinguishable (all nodes other than node 27 in group B ) identical u -values 
were learned. This, obviously, is highly desirable, and supports both the adequacy of the 
probabilistic model, and the effectiveness of the optimization procedure (which, starting 
from random initial values, could be feared to get stuck in local optima with non-identical 
values). 

We compare the results obtained with model (9),(10) with a slight modification of the 
distance model proposed in [10]. This model is given by (9) in conjunction with 

S = a+ Y (u(y,C)-u(W,C)) 2 . ( 12 ) 

c: community(C) 


Thus, the log-odds of the link probability now depend on the squared Euclidean distance 
between the latent feature vectors. This model, too, is readily encoded by a RBN, and the 
learned u -values are visualized in Figure 7 (c). The positions of the nodes in the latent space 
here are not interpretable as community centrality degrees, and (in line with the motivation 
given by the authors for this model) rather are suitable as node-coordinates for graph 
visualization and plotting. The model (9),(12) also achieves a much lower log-likelihood of 
-245 than the model (9),(10), which achieves a log-likelihood of -157. The baseline model 
that does not contain any latent feature vectors u, and only the a parameter is fitted (i.e., 
a fitted Erdos-Renyi model), achieves a log-likelihood score of -452. 

The likelihood graphs for the models (9),(10) and (9),(12) contain 5679 and 10235 nodes, 
respectively. The construction times for the graphs are around 0.1s and 0.8s, respectively. 
The times per restart of the learning procedure was around 31s for both models. The 
increased computation time per gradient computation in the larger graph for the second 
model was offset by a smaller number of iterations required until convergence. The obtained 
results are quite robust: solutions with very similar likelihood scores and structures as the 
ones shown in Figure 7 are usually obtained as the highest-scoring solutions within 3-5 
restarts. 
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Figure 8: The Wiring Room multiplex network 


5.2 Communities in Multi-Relational Networks 

We now generalize the model (9),(10) to multi-relational networks. For a network containing 
K relations linki (i = 1,..., K), we introduce I\ new numeric attributes ij(C') on the cluster 
objects of the domain. The values of the t t are unconstrained. The intention is that U(C) 
measures whether the existence of links of type i is positively (tj > 0) or negatively (ti < 0) 
correlated with membership in cluster C. We now define the probability P(linki(V,W )) 
using (9) in conjunction with 

Si = Oi+ J2 u(V,C)-u(W,C)-ti(C). (13) 

c: community(C) 


Given an observed network with K different link relations, we have to fit the model pa¬ 
rameters { oti | i = 1,..., K} U {u{V, C) | V,C : node(V), cluster(C)} U {ti(C) \ C, i : 
cluster(C),i = 1,... ,K}. This model is clearly not identifiable: for a given parameteriza¬ 
tion, multiplying all tj-values with a factor c > 0, and dividing all u-values by y/c leads to an 
equivalent parameterization. Absolute numeric values of the fitted parameters are therefore 
not significant, but relative magnitudes of values can still identify community structure. 

We apply the model to the multi-relational wiring room network [3] depicted in Fig¬ 
ure 8. The network consists of 14 nodes connected by 5 distinct relations. For better 
visibility, the relations here are displayed in two groups. Out of the 5 relations, 3 repre¬ 
sent positive relationships, 1 is antagonistic, and 1 (“arguments about opening a window”) 
potentially ambivalent. Relation 5 is directed, the others undirected. The coloring of the 
nodes represent a community structure found in [3] for this network. 
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Using 4 clusters, we learn u(-, Cj)- values for the 14 nodes as shown in Figure 9 , and 
tj(-)-values for the 4 communities as shown in Figure 10. The values u(-,C\) (light blue 
in Figure 9) identify nodes 9,10,11 as central nodes of community Ci, to which 8 and 14 
also are strongly associated. This very much coincides with the original green community. 
According to the tj(C\ )-values of Figure 10, membership in this community is most posi¬ 
tively associated with relations 2 and 3, and to a lesser extent 1 and 5. Relation 4 is clearly 
negatively associated with this community. Similarly, there is a good correspondence be¬ 
tween community C 3 , and the original yellow community. According to the fj-values, this 
community is most clearly associated with relations 1 and 3. Community Co considers re¬ 
lation 4 as strongly positive, and thereby provides a non-standard view on the community 
structure of this network, with nodes 7,8 the centers of this community. Finally, community 
C 4 also considers relations 4 as positive, but unlike for C 2 , there is a negative association 
with relation 2 . 

The likelihood graph here contained 7361 nodes. The reported result is the best obtained 
in 10 random restarts of the learning procedure, where one restart took about 1 minute to 
compute. 

5.3 Community Significance Measure 

It is highly desirable that a method not only returns the requested number of communities, 
but also provides a measure of the significance, or validity, of each community. On the basis 
of our probabilistic model, we obtain such a measure in terms of the explanatory value 
that a community provides for the observed network structure, where explanatory value 
is formalized by the likelihood gain obtained by including community information into the 
model. 

Specifically, to measure the explanatory value of community C\. defined by the u(-, Up¬ 
values of all nodes, we consider the model given by (9) in conjunction 

Si = oh + u(V, C k ) • u(W, C k ) • ti(C k ). (14) 

In this model, we now keep the previously learned u(-,Ck) values fixed, and re-learn the 
parameters {ccj | i = 1 U {U(Ck) \ i = As a baseline we take the 

Erdos-Renyi log-likelihood obtained when only fitting the Ui-values in a model without 
communities. We then define the likelihood gain obtained from community Ck as the log- 
likelihood obtained by (14), minus the Erdos-Renyi baseline. 



0 


Cl Cj c 3 c 4 

Figure 10: Learned U-values 
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Figure 11: Learning from sub-sampled data 


For the communities identified for the Wiring Room network we obtain the following 
likelihood gain values: C 3 : 71.4, C\ : 62.0, C 2 : 39.5, C 4 : 14.4. Thus, the ranking obtained 
by the likelihood gain scores reflects quite well the intuitive evaluation of the communities 
in terms of interpretability. 

5.4 Incomplete Network Data 

An important benefit of using probabilistic models for network analysis is the ability to 
handle incomplete information: for the likelihood function ( 11 ) it is not required that for 
every pair of nodes V,W the true/false status of the link relation is known. Unlike many 
other graph partitioning and community detection methods, probabilistic approaches can 
therefore easily handle incomplete graph data, where link(V,W) atoms can also have an 
unknown status. 

Apart from dealing with such potentially 3-valued graph data, we can also exploit this ro¬ 
bustness of the likelihood function to improve scalability to larger networks by sub-sampling 
the false-lmk data. Assuming complete network data, the number of factors in (11), and 
hence the number of nodes in the likelihood graph, is quadratic in the number of nodes 
of the network. Since networks tend to be sparse, the number of true links are usually 
greatly outnumbered by the false links, and one may expect that the community structure 
is already well identified by the true links, and a random sub-sample of the false links. 

To investigate the effects of learning from randomly sub-sanrpled data, we consider a 
multi-relational social network described in [19]. This network, which we call the Aarhus 
network, contains 61 nodes and 5 different relations. We apply our model (9),(13) with 
5 communities to data consisting of all the true links (of all 5 relations), and a random 
sub-sample of q% ( q = 100,50,20,10,5) of the false links. Having learned a.i,U and u 
parameters from sub-sanrpled data, we fix the learned tj and u parameters, and re-learn the 
a.i parameters using the full data. In conjunction with these adjusted ct; parameters, we 
evaluate the likelihood score of the learned f* and u parameters on the complete data. In 
this manner we can assess how well the ti,u parameters learned from sub-sanrpled data fit 
the complete data (since the learned a* essentially reflect estimates for the densities of the 
rj, these cannot fit the complete data when learned from a sub-sanrple in which false links 
are under-represented). 
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Figure 12: Pearson correlations between communities 


Figure 11 shows the best log-likelihood score achieved in 20 restarts each for the different 
percentages of sampled false links. Surprisingly, the likelihood score first even improves 
when data is sub-sampled. A possible explanation for this can be a higher variance of the 
scores achieved in different restarts for the larger datasets, and the best out of 20 restarts 
being further from a global optimum. Figure 11 also shows the time per restart for the 
different data sets. These times follow very closely the number of atoms in the data. 

We next try to determine how closely the communities identified from the sub-sampled 
data resemble the communities found from the full data. For this we consider as a reference 
the communities found from the 100% data in an extended sequence of 38 restarts, where 
the best solution then obtained a likelihood score of -3174. Let u re f(-,Ck) be the n-values 
for community Ck in this reference solution, and u q (-,Ck ) the u-values learned within 20 
restarts from the q% data (q = 100,50,20,10,5). For all g, we compute Pearson’s corre¬ 
lation between the vectors u re f(-,Ck ) and u q (-,Ck) for k = 1 ,..., 5, and then (manually) 
re-index the communities in u q to obtain the best pairwise matches (according to Pearson’s 
correlation) between the u re *{-,Ck) and u q (-,Ck)- Figure 12 shows a coarse heat-map visu¬ 
alization of all Pearson correlations after the re-indexing. Rows in this figure correspond to 
the reference communities u re \-,Ck) (k = 1,..., 5). The 5 main columns correspond to the 
communities u q (-, Ck)- An element at row k and column k! consists of 5 colored rectangles, 
representing the correlation between u req (-,Ck) and u q (-,Ck'), for q = 100,50,20,10,5 (in 
this order). Dark red stands for a correlation > 0.7, medium red for > 0.5, and yellow for 
> 0.3. The result shows that reference communities 1 and 4 are usually also identified from 
sub-sampled data. These two communities are also the ones which are identified as the 
most significant ones according to our significance measure, which evaluates to 292, 183, 
130, 330, 147 for communities 1,2,3,4,5, respectively. 

5.5 Related Work 

Probabilistic latent feature models for social networks (in the single-relational setting) have 
been proposed in [10]. The focus there, however, is more on obtaining interpretable, visual 
embeddings of the nodes in latent space, than on community analysis. 

To apply SRL modeling tools for node clustering in social network analysis has already 
been suggested in [20]. Clusters here consist of nodes with similar properties, however, not 
of connected communities of nodes. In [23] a nonparametric Bayesian model with discrete 
latent variables is proposed, that induces a hard partitioning of the nodes. That model is 
formulated for multi-relational networks, but only applied to single-relational ones in [23]. 
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Similarly, an RBN model with discrete latent variables for standard partitioning-based 
community detection was presented in [14]. 

6. Conclusion 

We have identified two distinct ways in which support for numerical data can be added to 
statistical relational models: as numerical probabilistic relations with an associated distri¬ 
bution model, or as numeric input relations, which can also be understood as object-specific 
model parameters. 

We have extended the RBN framework to allow for numeric input relations. Such an 
extension is particularly well supported by RBNs, because here relational (logical) atoms 
always have been treated syntactically and semantically as interchangeable with numeric 
parameters, and only minimal adjustments to the language and its inference and learning 
algorithms are required. By also introducing a logistic regression combination function, we 
obtain a framework that supports standard modelling techniques for categorical data in a 
relational setting, where both models and learned parameters are interpretable. We here 
have focused on logistic regression for conditioning binary response variables on numeric 
inputs, but other models could be integrated by adding additional combination functions 
to the language. The only requirement is that the combination functions are differentiable. 

The second part of the paper applies the extended RBNs to develop new models for 
community structure analysis in social networks. Specifically, we address the challenges of 
communities in multi-relational networks, and of assigning community centrality degrees 
for node-community pairs. Unlike most kinds of community membership degrees that are 
obtained by existing soft clustering methods, these ccd’s are not fractional membership 
assignments, but measures for how well a node is connected with each community. At 
the same time, when applied to multi-relational networks, the proposed model provides an 
explanation of how communities relate to different relations, and a validity measure that 
for the significance of each detected community. 

The RBN modeling tool provides a platform on which one can easily implement dif¬ 
ferent network models for community detection, and which are all supported by a single 
generic learning algorithm. Like for all general purpose modeling and inference tools, this 
generality comes at the price that for any particular model more efficient inference and 
learning techniques could probably be developed by dedicated implementations that can 
incorporate numerous problem-specific optimizations. Thus, should at any point in time 
more “industrial strength” applications be desired with the community structure model we 
proposed, then a new model-specific implementation may be needed. 

With the network models we have investigated in this paper we have stayed close to 
established models, and only scratched the surface of the modeling capabilities provided 
within the RBN language. One line of future work is to integrate these structural network 
models with dynamic models for information diffusion within the networks. 

The computational bottleneck in our implementation at this point is the size of the 
likelihood graph. We have already shown that to some extent this problem can be reduced 
by sub-sampling the false edges of the network. Other techniques we are currently explor¬ 
ing are optimization strategies in which in an iterative manner the likelihood function is 
only partially optimized based on smaller, partial likelihood graphs. Such iterative partial 
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optimizations can either follow a block gradient descent strategy, in which only subsets of 
parameters are optimized in each iteration, stochastic gradient descent strategies, in which 
only the likelihood function of a part of the data is optimized, or a combination of both 
these approaches. The challenge is to develop generic strategies that are widely applicable 
to a broad range of models, and that do not require the user to perform model-specific 
tuning of the learning strategy in each case. 

In principle, it would also be quite straightforward to add models for numeric proba¬ 
bilistic relations to the RBN framework. The language of probability formulas can directly 
be used also to define mean and variance of a Gaussian distribution (for example), and 
thereby define Gaussians that are in complex ways conditioned on continuous and categor¬ 
ical predictors. However, this will come at the price of loosing the tools for exact inference, 
and one would have to rely on sampling-based inference methods. 

All results presented in this paper are obtained using an updated version of the Primula 
implementation of RBNs 2 , which will become available with then next system release. 
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